Team - Dragon, Members - Jay, Justin, Bilal, Hardik & Parnika
Introduction
(a) Golden Trapdoor Spider
(b) Golden Trapdoor Spider
Figure 1: Euoplos Rainbow
Genus Euoplos Rainbow, belonging to the family Idiopidae, is a fascinating group of armored trapdoor spiders endemic to Australia. First described by William Joseph Rainbow in 1914, this genus comprises 14 recognized species. These spiders typically range in size from 10 to 25 mm and are characterized by their heavily armored carapaces and opisthosomae. While their carapaces tend to be dark in color, their opisthosomae can display striking patterns and vibrant colors.
Euoplos spiders are nocturnal hunters, primarily active during the night, emerging from their burrows to capture prey. Their burrows, often constructed in sandy or loamy soil, feature a hinged lid that serves as both protection and a concealed trap for unsuspecting prey. They are versatile predators, preying on a variety of insects and invertebrates with their powerful fangs.
These spiders can be found in diverse habitats across Australia, including forests, woodlands, grasslands, and deserts. Their preference for dry, open habitats is particularly noteworthy.
Information about the population trends of Euoplos Rainbow is limited, largely due to the secretive nature of these spiders. Some species may face threats from habitat loss, degradation, and competition from introduced spider species, but more research is needed to determine their conservation status.
In terms of expectations for exploring data on Euoplos Rainbow sightings, some unique possibilities include:
We expect to observe a concentration of Euoplos Rainbow occurrences in subtropical eastern Australia, reflecting their abundance in this region.
We anticipate that Euoplos Rainbow sightings are more likely to occur during hot and dry weather, particularly in the summer season.
Given their nocturnal behavior, we expect the majority of sightings to occur during nighttime hours.
We expect to observe a declining trend in Euoplos Rainbow sightings due to their endangered status.
We expect that population data for Euoplos Rainbow may be limited, as is common with many spider species due to their secretive behavior, making precise assessments challenging
These expectations provide a basis for exploring and analyzing the data on Euoplos Rainbow sightings while considering their unique biology and habitat preferences.
Data Cleaning
Load raw data
galah_config(email ="jaysangani04@gmail.com")Euoplos_Rainbow <-galah_call() |>galah_identify("Euoplos Rainbow") |>atlas_occurrences()# Filter records based on date (reliable sightings after 1990)Euoplos_Rainbow <- Euoplos_Rainbow %>%filter(eventDate >=as.Date("1990-01-01"))save(Euoplos_Rainbow, file ="data-raw/Euoplos_Rainbow.rda")
Filter out unreliable sightings (BASIS_OF_RECORD_INVALID)
Euoplos_Rainbow_assert <-galah_call() |>galah_identify("Euoplos Rainbow") |>galah_select( group ="assertions") |>atlas_occurrences() %>%filter(BASIS_OF_RECORD_INVALID !="TRUE")Euoplos_Rainbow_event <-galah_call() |>galah_identify("Euoplos Rainbow") |>galah_select(cl22, basisOfRecord, group ="event") |>atlas_occurrences()
Convert eventDate to Date format without the time component
# Convert timezone-specific times to plain hour:minute:second formattimezone_format <-grepl("\\+[0-9]{2}:[0-9]{2}|Z", Euoplos_Rainbow_event$eventTime)Euoplos_Rainbow_event$eventTime[timezone_format] <-substr(Euoplos_Rainbow_event$eventTime[timezone_format], 1, 8)# For plain hour:minute, append ":00" to make it hour:minute:secondplain_time_format <-grepl("^[0-9]{2}:[0-9]{2}$", Euoplos_Rainbow_event$eventTime)Euoplos_Rainbow_event$eventTime[plain_time_format] <-paste0(Euoplos_Rainbow_event$eventTime[plain_time_format], ":00")
Perform merging and filtering
# Merging based on eventDateEuoplos_Rainbow_2 <-merge(Euoplos_Rainbow, Euoplos_Rainbow_event, by="eventDate", keep.all=TRUE)# Merging the result with Euoplos_Rainbow_assert based on recordIDEuoplos_Rainbow_3 <-merge(Euoplos_Rainbow_2, Euoplos_Rainbow_assert, by="recordID", keep.all=TRUE)# Convert 12-hour format to 24-hour formattwelve_hour_format <-grepl("[APM]{2}", Euoplos_Rainbow_event$eventTime, ignore.case=TRUE)Euoplos_Rainbow_event$eventTime[twelve_hour_format] <-format(strptime(Euoplos_Rainbow_event$eventTime[twelve_hour_format], format="%I:%M %p"), "%H:%M:%S")#converting time from AM/PM to HH:MM:SSconvert_time <-function(time) {if(grepl("AM|PM", time, ignore.case =TRUE)) {return(format(parse_date_time(time, "h:M%p"), "%H:%M:%S")) }return(time)}Euoplos_Rainbow_3$eventTime <-sapply(Euoplos_Rainbow_3$eventTime, convert_time)#selecting neccesory variablesEuoplos_Rainbow_3 <- Euoplos_Rainbow_3 %>%select(decimalLatitude, decimalLongitude, eventDate,scientificName, taxonConceptID, recordID, dataResourceName, occurrenceStatus, BASIS_OF_RECORD_INVALID, eventTime, basisOfRecord, cl22) %>%rename(State = cl22)#removing duplicates from the datasetEuoplos_Rainbow_combined <- Euoplos_Rainbow_3 %>%distinct()Euoplos_Rainbow_combined$eventTime <- hms::as_hms(Euoplos_Rainbow_combined$eventTime)final_Euoplos_Rainbow <- Euoplos_Rainbow_combined %>%filter(basisOfRecord =="HUMAN_OBSERVATION")
Description of Variables in the `final_Euoplos_Rainbow` Dataset
Variable_Name
Data_Type
Description
decimalLatitude
Double
Latitude at which the sighting was recorded.
decimalLongitude
Double
Longitude at which the sighting was recorded.
eventDate
Date
The date when the sighting occurred.
scientificName
Character
The scientific name of the species; consistently labeled as 'Euoplos rainbow'.
taxonConceptID
Character
A unique URL that redirects to the specific taxonomy concept on biodiversity.org.au.
recordID
Character
A distinct identifier for every record in the dataset.
dataResourceName
Character
The title of the institution or data resource provider that collected the data.
occurrenceStatus
Character
Denotes the status of the sighting, with a common value being 'PRESENT'.
BASIS_OF_RECORD_INVALID
Logical
A binary marker indicating the validity of the record (TRUE/FALSE).
eventTime
Time
The precise time the event or sighting took place. Several records may have this detail omitted ('NA').
basisOfRecord
Character
Filter to only HUMAN_OBSERVATION value to ensure all data is wild sightings, original variable also include value of PRESERVED_SPECIMEN
Note
The Euoplos_Rainbow_combined dataset encapsulates 1,042 entries and is comprised of 12 variables. The data is derived from multiple sources, as can be discerned from the distinct dataResourceName entries. The main contributor appears to be the Western Australian Museum.
To ensure the dataset’s accuracy and relevance, several processing and cleaning steps were executed:
Filtering by Date: Only sightings post-1990 were considered to ensure the relevancy and reliability of the records.
Validity Check: Entries that were deemed unreliable (marked as BASIS_OF_RECORD_INVALID) were filtered out to maintain data integrity.
Date Formatting: The eventDate variable was converted to a standard Date format for uniformity.
Time Formatting: Timezone-specific timestamps were adjusted to a standardized hour:minute:second format. Any 12-hour formatted times were converted to a 24-hour format for consistency.
Merging Data: Multiple datasets were merged based on shared variables like eventDate and recordID to create a comprehensive dataset.
Removing Duplicates: Duplicate entries were identified and removed, ensuring each record in the dataset is unique.
Saving Dataset: After cleaning and formatting, the dataset was saved as an R object to ensure ease of access and repeatability of the analysis.
The dataset was sourced using the galah package in R, which interfaces with the Atlas of Living Australia (ALA). The ALA platform, available at Atlas of Living Australia, served as the primary source of the raw data.
For the analysis in R, libraries such as tidyverse, galah, and lubridate were employed. The galah library was particularly crucial for sourcing the data directly from ALA, while lubridate was instrumental in managing date and time fields. The tidyverse collection of packages enabled data manipulation, cleaning, and visualization.
Before proceeding with any advanced analysis, potential users of this dataset should always check for missing values, outliers, or other anomalies that might affect the results. Given the geospatial nature of the data, considerations for spatial analyses or visualizations could also be relevant.
Initial data analysis
Euoplos Rainbow data
Note
Data Quality Enhancement: The initial stage of our data processing involves the critical task of data quality enhancement. This process primarily consists of filtering out entries with missing values in key fields, including eventDate, Longitude, and Latitude. By taking this step, we are committed to ensuring that our subsequent geographical analyses are conducted with meaningful and comprehensive data.
Emphasis on Human Observations: In our data refinement approach, special attention is given to records where the basisOfRecord is explicitly labeled as “HUMAN_OBSERVATION.” This deliberate selection criteria signify our preference for incorporating data sourced directly from human observations, which is often regarded as a more reliable data source.
Weather data
Figure 2: Weather stations with 4 groups on map
Note
Initially, as shown in Figure 2, I divided the majority of Eastern Australia sightings into four distinct clusters and then matched each cluster with the nearest weather station. Any remaining sightings were categorized under the “Other” group. Consequently, for Group 1 through Group 4, the corresponding weather stations are Maroochydore Aero, Brisbane, Logan City Water Treatment, and Murwillumbah (Bray Park), respectively.
Note
Subsequently, I will individually narrow down the year range for each weather station, focusing on the periods when the majority of sightings occurred. This step is essential to ensure that weather-related data, such as precipitation, maximum temperature, and minimum temperature, are relevant to the analysis. I will achieve this by first identifying the respective cluster assigned to each weather station, combining the relevant datasets, and utilizing line plots to visually depict the years when sightings were recorded for each specific weather station.
Figure 3: Year range where most sightings occur for each weather station
Note
As illustrated in Figure 3 above, the primary sightings took place during the following periods: Maroochydore station from 2016 to 2023, Brisbane stations from 2020 to 2023, Logan City Water Treatment station from 2020 to 2023, and Murwillumbah station from 2018 to 2023, with a noticeable gap between 2019 and 2022. Consequently, our approach will involve filtering the data for all four weather stations based on these specific year ranges and subsequently performing a left join with the Euoplos Rainbow dataset.
1. Joining the sightings with tourism data
The plot labeled as “plot2” is a line plot that provides insights into the number of trips. It visualizes the number of trips on the y-axis while displaying quarters on the x-axis. The time frame covered in the plot ranges from 2016 to 2022, with only the first quarter (Q1) clearly displayed.
The plot highlights the contrasting tourism trends between two Australian states: New South Wales and Queensland. Here’s a more detailed analysis:
1.) Queensland: Tourism in Queensland has been consistently observed since 2016, with a gradual increase in the number of trips. Notable patterns in the data are observed:
After mid-2020 (around Q2 2020), there was a significant surge in tourism to Queensland, resulting in a peak.
Following this peak, there was a substantial decline in tourism in 2022, with the number of trips plummeting.
However, from mid-2022 onward, there appears to be a resurgence in tourism, although it hasn’t fully recovered to previous levels.
2.) New South Wales: In contrast to Queensland, tourism in New South Wales only became evident after 2020 Q1. Key observations include:
A remarkable and rapid increase in tourism was noted, peaking around 2022 Q1.
After this peak, there was a downturn in the number of trips.
These findings suggest that Queensland has been a consistent tourist destination since 2016, with fluctuations, including a significant drop in 2022. New South Wales saw a surge in tourism after 2020 Q1, reaching its zenith around 2022 Q1, followed by a decline.
The reasons behind these trends in tourism may include various factors such as seasonal variations, regional events, economic factors, or shifts in travel preferences. Further analysis and domain-specific knowledge may be necessary to better understand these patterns.
Exploratory data analysis
1. Most sightings occur in the Eastern Australia
Figure 4: All sightings on Australia map
Note
As depicted in Figure 4, our initial expectation that most sightings would occur in Eastern Australia has been confirmed, with the majority concentrated in Queensland and New South Wales, specifically around Brisbane and the Gold Coast. Notably, there have been two natural observations of Euoplos Rainbow in South Australia.
2. Most sightings occur during hot and dry weather, particularly in the summer season.
Maroochydore aero
Figure 5: Maroochydore station weather on successful sighting days
Brisbane
Figure 6: Brisbane station weather on successful sighting days
Logan city water treatment
Figure 7: Logan city station weather on successful sighting days
Murwillumbah (bray park)
Figure 8: Murwillumbah station weather on successful sighting days
Note
As observed in the four plots displayed above, namely Figure 5, Figure 6, Figure 7, and Figure 8, a consistent weather pattern emerges on days when sightings of Euoplos Rainbow occur. These patterns indicate predominantly dry weather with minimal to no rainfall, aligning with our initial expectations. However, when examining the minimum and maximum temperatures, they typically fall within the ranges of 10 to 20 degrees Celsius for minimum temperature and 20 to 30 degrees Celsius for maximum temperature. This observation contrasts with our initial assumption of hot weather and is more akin to the conditions typically experienced during the fall season.
3. The majority of sightings to occur during nighttime hours.
Figure 9: Distribution of Spider Sightings by Hour of the Day
Note
As depicted in the Figure 9 plot, it is worth noting that, surprisingly, the majority of Euoplos Rainbow sightings occur during daytime hours, 8:00 am to 3:00 pm which contradicts our initial assumption of predominantly nighttime activity. However, this unexpected finding can be attributed to the fact that fewer people venture outdoors during the night, reducing the chances of spotting a trapdoor spider. Consequently, we maintain our initial expectation that the primary active period for these spiders is during nighttime hours.
4. We expect to observe a declining trend in Euoplos Rainbow sightings due to their endangered status.
Figure 10: Number of successful sightings over time
Note
As illustrated in the Figure 10 plot, we observe a relatively low number of sightings before 2020, followed by a notable surge from the latter part of 2019, peaking in 2021. This phenomenon can be attributed to the peak of the Covid-19 pandemic during this period, which resulted in reduced human interference in natural habitats. Consequently, many species, including the Euoplos Rainbow, had the opportunity to return to their natural habitats, where they are typically found. Therefore, while this trend doesn’t align with the expected decrease in sightings, it is indeed positive news for the endangered Euoplos Rainbow and its habitat.
5. We expect that population data for Euoplos Rainbow may be limited, as is common with many spider species due to their secretive behavior, making precise assessments challenging.
Note
This can be readily confirmed by examining the dataset for Euoplos Rainbow, which contains a total of 94 rows. It’s important to note that these observations represent wild sightings, excluding any data from museums, historical records, or animals living in sanctuaries or zoos. When comparing this figure to the overall number of sightings in the initial dataset of 372 which encompasses both wild and non-wild observations, it becomes evident that the dataset is indeed limited in scope.
Temporal Data Analysis
The line chart provides a comprehensive view of the number of trips by event year, facilitating our temporal analysis. Queensland was selected as the focal state due to the majority of spider sightings in that area. We further narrowed our focus to specific regions, including Brisbane City, Brisbane Airport, and Logan Village, where spider sightings were prevalent. The “Purpose” variable is used to categorize and distinguish the various reasons for these trips.
Intriguingly, the data for Brisbane Airport reveals a consistent trend of almost zero trips recorded between 2016 and 2022. This unusual observation may raise questions about the accessibility or specific activities in this area.
In contrast, Brisbane City appears to be the preferred destination for the majority of trips, encompassing a wide range of purposes. It’s worth noting a significant drop in the number of trips at the beginning of 2020, potentially attributed to the global impact of the COVID-19 pandemic. However, as restrictions eased, spider sightings surged throughout the year.
Between 2016 and 2022, the primary purpose for travel, after the “Total” category, was “Business.” This indicates a strong link between business activities and trips during this period. Following closely were “Visiting” and “Holiday,” while the “Others” category had the fewest recorded trips.
Conversely, Logan Village exhibited a slightly higher number of trips compared to Brisbane Airport across all years. Further exploration and contextual information could shed light on the specific factors influencing travel patterns in these regions.
EcoTourism Analysis
Figure 11: Tourism Patterns Across Recent Quarters
Figure 11 shows the domestic trend of trips in the states of New South Wales and Queensland in the recent quarters.
Note
The plot illustrates the quarterly pattern of tourism over recent quarters from 2016 to 2022. In fact, the tourism is showing a consistent increasing trend throughout. Interestingly, the trips always exceeded threshold of 5M. However, there was a slight drop in the number of trips in the first quarter of 2020 when COVID-19 was at its peak but this pandemic could not stop people from enjoying their life. Q3 2020 soon saw a rise in the domestic trips despite lockdown.
Figure 12: Sightings Patterns Across Recent Quarters
Figure 12 shows the rare sightings of Golden Trapdoor Spider across the recent quarters.
Note
This scatter plot depicts the relationship between Golden Trapdoor Spider sightings across different quarters. It seems that these spiders are not a big fan of human beings. In fact, their sightings remained all low till 2020 until COVID-19 which in turn caused a burst in their sightings.
Figure 13: Regression Plot
Figure 13 uses the simple regression method to show the association between Sightings and Trips. For this analysis, we considered Sightings as a Response Variable and Tourism as an Explanatory Variable meaning we are using Tourism data to predict the values of spider sightings. It is evident from the above graph that there is no association between two variables. I don’t think that Sightings and Tourism share any casual relationship.
Lets drill further using Hypothesis Test to check the validity of our assumption.
term
estimate
std.error
statistic
p.value
(Intercept)
25440393.48
45618.709
557.67456
0
sightings
-39928.04
2234.177
-17.87148
0
[1] 0.000000e+00 4.440349e-71
H0 : There exists a linear association between Tourism and Spider Sightings.
α : 0.05 (Industry Standard)
p-value : 0.000000e+00, 4.440349e-71 (Calculated)
Note
Our Hypothesis analysis provides strong evidence that there is no association between the Golden Trapdoor Spider and tourism. The extremely small p-values, though not exactly zero, strongly suggest that the observed results are highly unlikely to have occurred by random chance. Consequently, we can confidently reject the null hypothesis implying the lack of any significant relationship between the presence of the Golden Trapdoor Spider and tourism.
Further, our p-value < α, thus, we have appropriate evidence to reject the null hypothesis.
[1] -0.1004898
We can also find the correlation between sightings and travel using cor function in it which shows the correlation a very weak negative correlation of -0.1 between two variables. This means that the two variables have a very slight tendency to move in opposite directions, but the relationship is not strong or significant.
Figure 14: Residue Plot
Figure 14 shows that there is weak relationship between travel and tourism.
Note
#Key Takeaways After all our checks, we can confidently say that there exists no relationship between tourism and travel.
Summary
Note
In summary, our exploratory data analysis (EDA) of Euoplos Rainbow sightings has revealed several intriguing findings, often challenging our initial expectations:
Geographical Distribution: Our initial expectation of a concentration of sightings in eastern Australia, particularly in Queensland and New South Wales, has been validated. The presence of Euoplos Rainbow in South Australia was also a noteworthy discovery.
Weather Patterns: The weather conditions on days when Euoplos Rainbow sightings occur are predominantly dry with minimal rainfall, aligning with our initial anticipation. However, the temperature range is more moderate than expected, resembling conditions typical of the fall season.
Activity Hours: Surprisingly, Euoplos Rainbow sightings are more common during daytime hours, which contradicts our initial assumption of nocturnal behavior. This observation is explained by reduced human activity outdoors at night.
Temporal Trends: Instead of a decreasing trend in sightings, as expected for an endangered species, we observed a surge in sightings during the peak of the Covid-19 pandemic. Reduced human interference in natural habitats likely allowed these spiders to thrive.
Limited Data: The dataset for Euoplos Rainbow consists mainly of wild sightings, and when compared to the broader dataset, it becomes clear that data availability for this species is limited.
These findings underscore the complexity of species behavior and the influence of external factors. While some expectations were confirmed, others challenged our assumptions, emphasizing the importance of data-driven insights in ecological research.
References
Australian Faunal Directory. (n.d.). Euoplos. https://biodiversity.org.au/afd/taxa/Euoplos
Atlas of Living Australia. (n.d.). Euoplos. https://bie.ala.org.au/species/https://biodiversity.org.au/afd/taxa/1b5cd7fc-fed7-4788-ac39-b33cafc7bbb4
Australian Spiders in Colour. (n.d.). Spider Identification. https://www.termite.com.au/spider-identification.html
Find-a-spider Guide. (n.d.). A Photographic Guide to Australian Spiders. http://www.findaspider.org.au/find/spiders/409.htm
Package Citation
tidyverse
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi: 10.21105/joss.01686 (URL: https://doi.org/10.21105/joss.01686).
galah
Westgate M, Stevenson M, Kellie D, Newman P (2023). galah: Atlas of Living Australia (ALA) Data and Resources in R. R package version 1.5.2, <URL: https://CRAN.R-project.org/package=galah>.
visdat
Tierney N (2017). “visdat: Visualising Whole Data Frames.” JOSS, 2(16), 355. doi: 10.21105/joss.00355 (URL: https://doi.org/10.21105/joss.00355), <URL: http://dx.doi.org/10.21105/joss.00355>.
rnoaa
Scott Chamberlain and Daniel Hocking (2023). rnoaa: ‘NOAA’ Weather Data from R. R package version 1.4.0. https://CRAN.R-project.org/package=rnoaa
lubridate
Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL https://www.jstatsoft.org/v40/i03/.
ozmaps
Michael Sumner (2021). ozmaps: Australia Maps. R package version 0.4.5. https://CRAN.R-project.org/package=ozmaps
patchwork
Thomas Lin Pedersen (2023). patchwork: The Composer of Plots. R package version 1.1.3. https://CRAN.R-project.org/package=patchwork
ggrepel
Kamil Slowikowski (2023). ggrepel: Automatically Position Non-Overlapping Text Labels with ‘ggplot2’. R package version 0.9.3. https://CRAN.R-project.org/package=ggrepel
colorspace
Zeileis A, Fisher JC, Hornik K, Ihaka R, McWhite CD, Murrell P, Stauffer R, Wilke CO (2020). “colorspace: A Toolbox for Manipulating and Assessing Colors and Palettes.” Journal of Statistical Software, 96(1), 1-49. doi: 10.18637/jss.v096.i01 (URL: https://doi.org/10.18637/jss.v096.i01).
plotly
C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020.
kableExtra
Hao Zhu (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra